k-means clustering
CAS Condensed and Accelerated Silhouette: An Efficient Method for Determining the Optimal K in K-Means Clustering
Das, Krishnendu, Gupta, Sumit, Kumar, Awadhesh
--Clustering is a critical component of decision-making in today's data-driven environments. Clustering has been widely used in a variety of fields, such as bioinformatics, social network analysis, and image processing. However, clustering accuracy remains a major challenge in large datasets. This paper presents a comprehensive overview of strategies for selecting optimal k in clustering, with a focus on achieving a balance between clustering precision and computational efficiency in complex data environments. In addition, this paper introduces improvements to clustering techniques relating to text and image data to provide insights into better computational performance and cluster validity. The proposed approach is based on the Condensed Silhouette method, a statistical methods like Local Structures, Gap Statistics, Class-Consistency Ratio and Cluster Overlap Index(CCR-COI) based algorithm to calculate the best value of K for K-Means Clustering the data. The results of comparative experiments show that the proposed approach achieves up to 99% faster execution times on high-dimensional datasets while retaining both precision and scalability, making it highly suitable for real-time clustering needs or scenarios demanding efficient clustering with minimal resource utilization. Clustering is a critical component of unsupervised machine learning, with the K -means algorithm being particularly favored due to its straightforwardness, speed, and ability to be easily understood. Nonetheless, a major difficulty lies in accurately identifying the best number of clusters, K, especially with expansive and high-dimensional datasets where it is crucial to strike an effective balance between computational efficiency and accuracy.
- Research Report (1.00)
- Overview (1.00)
Reviews: k-Means Clustering of Lines for Big Data
The authors consider the problem of clustering a set of lines in R d. The goal is to minimize the k-means objective: given n lines L in R d find the best set of k points c1,...,ck in R d so as to minimize sum_{l in L} min_{ci} dist(ci, l) 2. This a clean, nicely motivated problem. The authors provide a coreset construction (namely a small size summary of the input so that any alpha-approximation for the summary yields an alpha(1 epsilon)-approximation for the entire input). This implies the first (1 epsilon)-approximation for the problem with running time nd exp(poly(k)) together with a streaming algorithm with similar running time and memory size 2 {poly(k)} log n. En route to the result the authors provide a bicriteria approximation algorithms: namely a solution that contains O(k (log n dk log k)) centers and whose cost is at most 4 times the cost of the optimal solution with k centers.
Reviews: k-Means Clustering of Lines for Big Data
This paper proposes an PTAS for k-means clustering of lines. The key contribution is the construction of a small coreset, on which brute force algorithms are run. The authors also extend this to the streaming setting. An important computer vision application is used as motivation. The authors should revise the final version to address the issues raised by the reviewers, and make it more readable to researchers in related but not in the exact area.
k-Means Clustering of Lines for Big Data
The input to the \emph{ k -mean for lines} problem is a set L of n lines in \mathbb{R} d, and the goal is to compute a set of k centers (points) in \mathbb{R} d that minimizes the sum of squared distances over every line in L and its nearest center. This is a straightforward generalization of the k -mean problem where the input is a set of n points instead of lines. We suggest the first PTAS that computes a (1 \epsilon) -approximation to this problem in time O(n \log n) for any constant approximation error \epsilon \in (0, 1), and constant integers k, d \geq 1 . This is by proving that there is always a weighted subset (called coreset) of dk {O(k)}\log (n)/\epsilon 2 lines in L that approximates the sum of squared distances from L to \emph{any} given set of k points. Using traditional merge-and-reduce technique, this coreset implies results for a streaming set (possibly infinite) of lines to M machines in one pass (e.g.
QoS-Nets: Adaptive Approximate Neural Network Inference
Trommer, Elias, Waschneck, Bernd, Kumar, Akash
In order to vary the arithmetic resource consumption of neural network applications at runtime, this work proposes the flexible reuse of approximate multipliers for neural network layer computations. We introduce a search algorithm that chooses an appropriate subset of approximate multipliers of a user-defined size from a larger search space and enables retraining to maximize task performance. Unlike previous work, our approach can output more than a single, static assignment of approximate multiplier instances to layers. These different operating points allow a system to gradually adapt its Quality of Service (QoS) to changing environmental conditions by increasing or decreasing its accuracy and resource consumption. QoS-Nets achieves this by reassigning the selected approximate multiplier instances to layers at runtime. To combine multiple operating points with the use of retraining, we propose a fine-tuning scheme that shares the majority of parameters between operating points, with only a small amount of additional parameters required per operating point. In our evaluation on MobileNetV2, QoS-Nets is used to select four approximate multiplier instances for three different operating points. These operating points result in power savings for multiplications between 15.3% and 42.8% at a Top-5 accuracy loss between 0.3 and 2.33 percentage points. Through our fine-tuning scheme, all three operating points only increase the model's parameter count by only 2.75%.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Europe > Germany > Saxony > Dresden (0.04)
- North America > Puerto Rico > San Juan > San Juan (0.04)
CTG-KrEW: Generating Synthetic Structured Contextually Correlated Content by Conditional Tabular GAN with K-Means Clustering and Efficient Word Embedding
Samanta, Riya, Saha, Bidyut, Ghosh, Soumya K., Das, Sajal K.
Conditional Tabular Generative Adversarial Networks (CTGAN) and their various derivatives are attractive for their ability to efficiently and flexibly create synthetic tabular data, showcasing strong performance and adaptability. However, there are certain critical limitations to such models. The first is their inability to preserve the semantic integrity of contextually correlated words or phrases. For instance, skillset in freelancer profiles is one such attribute where individual skills are semantically interconnected and indicative of specific domain interests or qualifications. The second challenge of traditional approaches is that, when applied to generate contextually correlated tabular content, besides generating semantically shallow content, they consume huge memory resources and CPU time during the training stage. To address these problems, we introduce a novel framework, CTGKrEW (Conditional Tabular GAN with KMeans Clustering and Word Embedding), which is adept at generating realistic synthetic tabular data where attributes are collections of semantically and contextually coherent words. CTGKrEW is trained and evaluated using a dataset from Upwork, a realworld freelancing platform. Comprehensive experiments were conducted to analyze the variability, contextual similarity, frequency distribution, and associativity of the generated data, along with testing the framework's system feasibility. CTGKrEW also takes around 99\% less CPU time and 33\% less memory footprints than the conventional approach. Furthermore, we developed KrEW, a web application to facilitate the generation of realistic data containing skill-related information. This application, available at https://riyasamanta.github.io/krew.html, is freely accessible to both the general public and the research community.
- South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.04)
- North America > United States > Missouri (0.04)
- North America > Canada > Nova Scotia > Halifax Regional Municipality > Halifax (0.04)
- (2 more...)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (0.68)
Transforming Movie Recommendations with Advanced Machine Learning: A Study of NMF, SVD,and K-Means Clustering
Yan, Yubing, Moreau, Camille, Wang, Zhuoyue, Fan, Wenhan, Fu, Chengqian
Keywords-recommendation system; machine learning; Non-groups based on their viewing patterns. Agent Recurrent Deterministic Policy Gradient (MA-RDPG) The proliferation of digital content has necessitated the algorithm, as suggested by Zhao et al., this research aims to development of effective recommendation systems to aid users optimize overall system performance through enhanced in navigating vast amounts of data. This research aims to explore and implement advanced machine Previous studies have extensively explored collaborative learning techniques [1-6] to create a high-performing movie filtering techniques for recommendation systems. The study addresses the following (2001) [13] demonstrated the effectiveness of matrix research questions: What are the most effective machine factorization in uncovering latent user-item interactions. How do et al. (2009) [14] further refined these techniques, leading to these models compare in terms of accuracy and relevance?
- North America > United States > New York (0.05)
- North America > United States > California > Los Angeles County > Los Angeles (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Europe > France > Île-de-France > Paris > Paris (0.04)
- Research Report > New Finding (0.49)
- Research Report > Experimental Study (0.34)
A Review of Machine Learning-based Security in Cloud Computing
Babaei, Aptin, Kebria, Parham M., Dalvand, Mohsen Moradi, Nahavandi, Saeid
Cloud Computing (CC) is revolutionizing the way IT resources are delivered to users, allowing them to access and manage their systems with increased cost-effectiveness and simplified infrastructure. However, with the growth of CC comes a host of security risks, including threats to availability, integrity, and confidentiality. To address these challenges, Machine Learning (ML) is increasingly being used by Cloud Service Providers (CSPs) to reduce the need for human intervention in identifying and resolving security issues. With the ability to analyze vast amounts of data, and make high-accuracy predictions, ML can transform the way CSPs approach security. In this paper, we will explore some of the most recent research in the field of ML-based security in Cloud Computing. We will examine the features and effectiveness of a range of ML algorithms, highlighting their unique strengths and potential limitations. Our goal is to provide a comprehensive overview of the current state of ML in cloud security and to shed light on the exciting possibilities that this emerging field has to offer.
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > United States (0.04)
- Asia > China (0.04)
- Information Technology > Services (1.00)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Cloud Computing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.97)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)
clustering an african hairstyle dataset using pca and k-means
Nicrocia, Teffo Phomolo, Adewale, Owolawi Pius, Diana, Pholo Moanda
The adoption of digital transformation was not expressed in building an African face shape classifier. In this paper, an approach is presented that uses k-means to classify African women images. African women rely on beauty standards recommendations, personal preference, or the newest trends in hairstyles to decide on the appropriate hairstyle for them. In this paper, an approach is presented that uses K-means clustering to classify African women's images. In order to identify potential facial clusters, Haarcascade is used for feature-based training, and K-means clustering is applied for image classification.
- North America > United States > Georgia > Clarke County > Athens (0.14)
- Africa > South Africa > Gauteng > Pretoria (0.05)
- Europe > Italy (0.04)
- (6 more...)
K-means Clustering Based Feature Consistency Alignment for Label-free Model Evaluation
Miao, Shuyu, Zheng, Lin, Liu, Jingjing, Jin, and Hong
The label-free model evaluation aims to predict the model performance on various test sets without relying on ground truths. The main challenge of this task is the absence of labels in the test data, unlike in classical supervised model evaluation. This paper presents our solutions for the 1st DataCV Challenge of the Visual Dataset Understanding workshop at CVPR 2023. Firstly, we propose a novel method called K-means Clustering Based Feature Consistency Alignment (KCFCA), which is tailored to handle the distribution shifts of various datasets. KCFCA utilizes the K-means algorithm to cluster labeled training sets and unlabeled test sets, and then aligns the cluster centers with feature consistency. Secondly, we develop a dynamic regression model to capture the relationship between the shifts in distribution and model accuracy. Thirdly, we design an algorithm to discover the outlier model factors, eliminate the outlier models, and combine the strengths of multiple autoeval models. On the DataCV Challenge leaderboard, our approach secured 2nd place with an RMSE of 6.8526. Our method significantly improved over the best baseline method by 36\% (6.8526 vs. 10.7378). Furthermore, our method achieves a relatively more robust and optimal single model performance on the validation dataset.
- Research Report > Promising Solution (0.48)
- Research Report > New Finding (0.47)